{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# 语言模型\n",
    "\n",
    "### 自然语言统计"
   ],
   "id": "28a5592bda26e5aa"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "根据time_machine数据集构建词表， 并打印前10个最常用的（频率最高的）单词",
   "id": "f2e2c0e0cd2041a8"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T08:49:16.593212Z",
     "start_time": "2024-08-27T08:49:14.992168Z"
    }
   },
   "cell_type": "code",
   "source": [
    "import random\n",
    "import torch\n",
    "from d2l import torch as d2l\n",
    "import re\n",
    "\n",
    "\n",
    "def read_time_machine():  # @save\n",
    "    \"\"\"将时间机器数据集加载到文本行的列表中\"\"\"\n",
    "    with open('time_machine.txt', 'r') as f:\n",
    "        lines = f.readlines()\n",
    "    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]\n",
    "\n",
    "\n",
    "tokens = d2l.tokenize(read_time_machine())\n",
    "# 因为每个文本行不一定是一个句子或一个段落，因此我们把所有文本行拼接到一起\n",
    "corpus = [token for line in tokens for token in line]\n",
    "vocab = d2l.Vocab(corpus)\n",
    "vocab.token_freqs[:10]"
   ],
   "id": "initial_id",
   "execution_count": 1,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "正如我们所看到的，高频词词看起来很无意义， 这些词通常被称为*停用词*（stop words），因此可以被过滤掉。 尽管如此，它们本身仍然是有意义的，我们仍然会在模型中使用它们。 此外，还有个明显的问题是词频衰减的速度相当地快。 例如，最常用单词的词频对比，第10个还不到第1个的1/5。 为了更好地理解，我们可以画出的词频图：",
   "id": "5f81d8ce6490ff43"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "execution_count": 2,
   "source": [
    "freqs = [freq for token, freq in vocab.token_freqs]\n",
    "d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)', xscale='log', yscale='log')"
   ],
   "id": "d0a9a52802f27728",
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "二元语法",
   "id": "1d96cc6a46c12d79"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T08:53:15.033592Z",
     "start_time": "2024-08-27T08:53:14.997364Z"
    }
   },
   "cell_type": "code",
   "source": [
    "bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])]\n",
    "bigram_vocab = d2l.Vocab(bigram_tokens)\n",
    "bigram_vocab.token_freqs[:10]"
   ],
   "id": "2cb0481c7a6666a9",
   "execution_count": 3,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "三元语法：这里值得注意：在十个最频繁的词对中，有九个是由两个停用词组成的， 只有一个与“the time”有关。 我们再进一步看看三元语法的频率是否表现出相同的行为方式。",
   "id": "6687a757325e2f6"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T08:54:26.526523Z",
     "start_time": "2024-08-27T08:54:26.498560Z"
    }
   },
   "cell_type": "code",
   "source": [
    "trigram_tokens = [triple for triple in zip(corpus[:-2], corpus[1:-1], corpus[2:])]\n",
    "trigram_vocab = d2l.Vocab(trigram_tokens)\n",
    "trigram_vocab.token_freqs[:10]"
   ],
   "id": "4b2c66e7d7ef7efa",
   "execution_count": 5,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "最后，我们直观地对比三种模型中的词元频率：一元语法、二元语法和三元语法。",
   "id": "e24d7102effe32c7"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T08:55:31.497568Z",
     "start_time": "2024-08-27T08:55:31.270895Z"
    }
   },
   "cell_type": "code",
   "source": [
    "bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]\n",
    "trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]\n",
    "d2l.plot([freqs, bigram_freqs, trigram_freqs],\n",
    "         xlabel='token: x',\n",
    "         ylabel='frequency: n(x)',\n",
    "         xscale='log',\n",
    "         yscale='log',\n",
    "         legend=['unigram', 'bigram', 'trigram'])"
   ],
   "id": "715dfff145910899",
   "execution_count": 6,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 读取长序列数据\n",
    "\n",
    "- 随机采样\n",
    "- 顺序分区"
   ],
   "id": "9fa5e47901deec41"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "**随机采样**\n",
    "\n",
    "下面的代码每次可以从数据中随机生成一个小批量。 在这里，参数`batch_size`指定了每个小批量中子序列样本的数目， 参数`num_steps`是每个子序列中预定义的时间步数。"
   ],
   "id": "68b2ca83b330e69"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T09:07:48.758063Z",
     "start_time": "2024-08-27T09:07:48.729352Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def seq_data_iter_random(corpus, batch_size, num_steps):  #@save\n",
    "    \"\"\"使用随机抽样生成一个小批量子序列\"\"\"\n",
    "    # 从随机偏移量开始对序列进行分区，随机范围包括num_steps-1\n",
    "    corpus = corpus[random.randint(0, num_steps - 1):]\n",
    "    # 减去1，是因为我们需要考虑标签\n",
    "    num_subseqs = (len(corpus) - 1) // num_steps\n",
    "    # 长度为num_steps的子序列的起始索引\n",
    "    initial_indices = list(range(0, num_subseqs * num_steps, num_steps))\n",
    "    # 在随机抽样的迭代过程中，\n",
    "    # 来自两个相邻的、随机的、小批量中的子序列不一定在原始序列上相邻\n",
    "    random.shuffle(initial_indices)\n",
    "\n",
    "    def data(pos):\n",
    "        # 返回从pos位置开始的长度为num_steps的序列\n",
    "        return corpus[pos: pos + num_steps]\n",
    "\n",
    "    num_batches = num_subseqs // batch_size\n",
    "    for i in range(0, batch_size * num_batches, batch_size):\n",
    "        # 在这里，initial_indices包含子序列的随机起始索引\n",
    "        initial_indices_per_batch = initial_indices[i: i + batch_size]\n",
    "        X = [data(j) for j in initial_indices_per_batch]\n",
    "        Y = [data(j + 1) for j in initial_indices_per_batch]\n",
    "        yield torch.tensor(X), torch.tensor(Y)"
   ],
   "id": "e12daf18bc5f74bd",
   "execution_count": 7,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "下面我们生成一个从0到34的序列。 假设批量大小为2，时间步数为5，这意味着可以生成 ⌊(35−1)/5⌋=6个 “特征－标签”子序列对。 如果设置小批量大小为2，我们只能得到3个小批量。",
   "id": "1b1d111fb6a98612"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T09:08:52.858844Z",
     "start_time": "2024-08-27T09:08:52.822972Z"
    }
   },
   "cell_type": "code",
   "source": [
    "my_seq = list(range(35))\n",
    "for X, Y in seq_data_iter_random(my_seq, batch_size=2, num_steps=5):\n",
    "    print('X: ', X, '\\nY:', Y)"
   ],
   "id": "bbad4df0e7b35601",
   "execution_count": 8,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "**顺序分区**\n",
    "\n",
    "在迭代过程中，除了对原始序列可以随机抽样外， 我们还可以保证两个相邻的小批量中的子序列在原始序列上也是相邻的。 这种策略在基于小批量的迭代过程中保留了拆分的子序列的顺序，因此称为顺序分区。"
   ],
   "id": "ef01ffd16a170cbc"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T09:11:23.784103Z",
     "start_time": "2024-08-27T09:11:23.777272Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def seq_data_iter_sequential(corpus, batch_size, num_steps):  #@save\n",
    "    \"\"\"使用顺序分区生成一个小批量子序列\"\"\"\n",
    "    # 从随机偏移量开始划分序列\n",
    "    offset = random.randint(0, num_steps)\n",
    "    num_tokens = ((len(corpus) - offset - 1) // batch_size) * batch_size\n",
    "    Xs = torch.tensor(corpus[offset: offset + num_tokens])\n",
    "    Ys = torch.tensor(corpus[offset + 1: offset + 1 + num_tokens])\n",
    "    Xs, Ys = Xs.reshape(batch_size, -1), Ys.reshape(batch_size, -1)\n",
    "    num_batches = Xs.shape[1] // num_steps\n",
    "    for i in range(0, num_steps * num_batches, num_steps):\n",
    "        X = Xs[:, i: i + num_steps]\n",
    "        Y = Ys[:, i: i + num_steps]\n",
    "        yield X, Y"
   ],
   "id": "2714196d947bad15",
   "execution_count": 9,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "基于相同的设置，通过顺序分区读取每个小批量的子序列的特征X和标签Y。 通过将它们打印出来可以发现： 迭代期间来自两个相邻的小批量中的子序列在原始序列中确实是相邻的。",
   "id": "1c6207b23269fb13"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T09:11:49.726234Z",
     "start_time": "2024-08-27T09:11:49.710546Z"
    }
   },
   "cell_type": "code",
   "source": [
    "for X, Y in seq_data_iter_sequential(my_seq, batch_size=2, num_steps=5):\n",
    "    print('X: ', X, '\\nY:', Y)"
   ],
   "id": "31d94969334c4d52",
   "execution_count": 10,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "**采样函数包装到一个类**\n",
    "\n",
    "现在，我们将上面的两个采样函数包装到一个类中， 以便稍后可以将其用作数据迭代器。"
   ],
   "id": "e7babbc5be10d82f"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T09:13:06.488937Z",
     "start_time": "2024-08-27T09:13:06.483350Z"
    }
   },
   "cell_type": "code",
   "source": [
    "class SeqDataLoader:  #@save\n",
    "    \"\"\"加载序列数据的迭代器\"\"\"\n",
    "\n",
    "    def __init__(self, batch_size, num_steps, use_random_iter, max_tokens):\n",
    "        if use_random_iter:\n",
    "            self.data_iter_fn = d2l.seq_data_iter_random\n",
    "        else:\n",
    "            self.data_iter_fn = d2l.seq_data_iter_sequential\n",
    "        self.corpus, self.vocab = d2l.load_corpus_time_machine(max_tokens)\n",
    "        self.batch_size, self.num_steps = batch_size, num_steps\n",
    "\n",
    "    def __iter__(self):\n",
    "        return self.data_iter_fn(self.corpus, self.batch_size, self.num_steps)"
   ],
   "id": "5375f712a8e890eb",
   "execution_count": 12,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "最后，我们定义了一个函数`load_data_time_machine`， 它同时返回数据迭代器和词表。",
   "id": "46533bf23aeee9a4"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-27T09:14:42.419799Z",
     "start_time": "2024-08-27T09:14:42.413671Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def load_data_time_machine(batch_size, num_steps, use_random_iter=False, max_tokens=10000):\n",
    "    \"\"\"返回时光机器数据集的迭代器和词表\"\"\"\n",
    "    data_iter = SeqDataLoader(\n",
    "        batch_size, num_steps, use_random_iter, max_tokens)\n",
    "    return data_iter, data_iter.vocab"
   ],
   "id": "1e9e057b4f2baafe",
   "execution_count": 13,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "execution_count": null,
   "source": "",
   "id": "339b544c1f87f712",
   "outputs": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
